prompt engineering

Effective context engineering for AI agents \ Anthropic

能力の高いAIエージェントを構築する上で、適切なコンテキスト設計が不可欠

コンテキストウィンドウ内のトークン数が増加するにつれ、モデルがその文脈から情報を正確かつ効率的に想起する能力が低下する

Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research

モデルによって劣化の程度には差があるが、特性はすべてのモデルに共通

LLMはTransformerを基盤としており、これにより各トークンがコンテキスト全体にわたって他のすべてのトークンに注意を向けることが可能になります。その結果、n個のトークンに対してn²個のペアワイズな関係性が生じる。モデルのコンテキスト長が長くなるにつれ、コンテキストサイズと注意集中の間にトレードオフが生じます。

位置エンコーディング補間などの技術を用いて当初学習した小規模なコンテキストに適応させることで、より長いシーケンスを処理できるようになります。トレードオフ：トークン位置の理解が劣化

モデルはより長いコンテキストにおいても高い能力を維持しますが、情報検索や長距離推論に関しては、短いコンテキストでの性能と比較して精度が低下する

LLMのプロンプトエンジニアリング - O'Reilly Japan

GitHub Copilotの実装過程で得られた知見

Prompt engineering overview - Claude Docs

ツール

https://github.com/features/models

仕様を定義する

2024-03-09 Claude 3 OpusはGPT-4では難しい「オホーツクに消ゆ」ライクなアドベンチャーゲーム生成ができる - ABAの日誌

2023年09月27日番外編：一度得たプロンプトマネジメントの成果を一発で再現する方法 - CNET Japan

promptをLLMに聞くのは面白いけど有用なのだろうか？

https://www.promptingguide.ai/jp

GitHub - dair-ai/Prompt-Engineering-Guide: Guides, papers, lecture, and resources for prompt engineering

Reddit - Dive into anything

https://arxiv.org/abs/2205.11916 Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

@odashi_t: promptについて研究するなら、半年後に無効化されるようなその場限りのハックではなく、"let's think step-by-step." のようなLLMが言語モデルである限り有効と推測される一般的な手法を探すのが正しい方向性だと思います。